Context:

AllLife Bank has a growing customer base. Majority of these customers are liability customers (depositors) with varying size of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio with a minimal budget.

You as a Data scientist at AllLife bank has to build a model that will help marketing department to identify the potential customers who have higher probability of purchasing the loan. This will increase the success ratio while at the same time reduce the cost of the campaign.

Objective:

Objective

To predict whether a liability customer will buy a personal loan or not. Which variables are most significant. Which segment of customers should be targeted more.

Dataset

Import the necessary packages

Read the dataset

View the first and last 10 rows of the dataset.

Understand the shape of the dataset.

Let's check the duplicate data. And if any, we should remove it.

Check the data types of the columns for the dataset.

Check for missing values

Give a statistical summary for the dataset.

Most features are boolean type.

EDA

Univariate analysis

Age and Experience looks uniformaly disributed. Income and CCAvg looks rightly skewed. Most other features are boolean.

CCavg and Martgage have a lot of outliers. We will leave them in the dataset

Observations on Numerical countplots

Most of the featues have unifrom distribution. Mortgage has a right skewed distribution

Observations on Correlation with Heatmap

Data Preperation

Model Building - Approach

  1. Data preparation
  2. Partition the data into train and test set.
  3. Built a CART model on the train data.
  4. Tune the model and prune the tree, if required.
  5. Test the data on test set.

Split Data

Logisitic Regression Model

Not much change in recall, let's drop Family

recall on train is same So lg3 is the final model that we will use for predictions and inferences Let's use model 'lg3' for making interpretations Education and CD_account are importanat variables

Check Model Performance

Try to improve Recall using AUC-ROC curve

Decreasing threshold beyond 0.2 will lead to fast decrease in Precision, which will lead to great loss of opportunity, so let's consider threshold of 0.2

Build Decision Tree Model

We only have 9% of positive classes, so if our model marks each sample as negative, then also we'll get 90% accuracy, hence accuracy is not a good metric to evaluate here.

Insights:

Misclassification analysis

The data imbalance is present in the sample or population itself, as only a small percentage of customers actually sign up for the targetted schemes such as a personal loan. So the model itself is not doing a poor job of misclassifying between the predicted and actual. The false positives and false negatives (Class 1 and Class 2 errors) are actually fairly low(~1%) in teh confusion matrix above

Visualizing the Decision Tree

According to the decision tree model, Zipcode is the most important variable for predicting the potential customer for loan. This could be due to ceratin zipcodes have houses where families with high income and hish education live and have high credit card usage.

Reducing over fitting

Using GridSearch for Hyperparameter tuning of our tree model

Recall has reduced model overfitting after hyperparameter tuning and we have a generalized model.

Visualizing the Decision Tree

Cost Complexity Pruning

The DecisionTreeClassifier provides parameters such as min_samples_leaf and max_depth to prevent a tree from overfiting. Cost complexity pruning provides another option to control the size of a tree. In DecisionTreeClassifier, this pruning technique is parameterized by the cost complexity parameter, ccp_alpha. Greater values of ccp_alpha increase the number of nodes pruned. Here we only show the effect of ccp_alpha on regularizing the trees and how to choose a ccp_alpha based on validation scores.

Total impurity of leaves vs effective alphas of pruned tree

Minimal cost complexity pruning recursively finds the node with the "weakest link". The weakest link is characterized by an effective alpha, where the nodes with the smallest effective alpha are pruned first. To get an idea of what values of ccp_alpha could be appropriate, scikit-learn provides DecisionTreeClassifier.cost_complexity_pruning_path that returns the effective alphas and the corresponding total leaf impurities at each step of the pruning process. As alpha increases, more of the tree is pruned, which increases the total impurity of its leaves.

Next, we train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with one node.

For the remainder, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.

Maximum value of Recall is at 0.014 alpha, but if we choose decision tree will only have a root node and we would lose the buisness rules, instead we can choose alpha 0.006 retaining information and getting higher recall.

Visualizing the Decision Tree

Comparing Baseline Logistic Regression and all the decision tree models

Decision tree model with post pruning has given the best recall score on data.

Conclusions

Recommendations